SeqArray - a storage-efficient high-performance data format for WGS variant calls
نویسندگان
چکیده
Motivation Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Results Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data. Availability and Implementation http://www.bioconductor.org/packages/SeqArray. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
A practical method to detect SNVs and indels from whole genome and exome sequencing data
The recent development of massively parallel sequencing technology has allowed the creation of comprehensive catalogs of genetic variation. However, due to the relatively high sequencing error rate for short read sequence data, sophisticated analysis methods are required to obtain high-quality variant calls. Here, we developed a probabilistic multinomial method for the detection of single nucle...
متن کاملSeqHBase: a big data toolset for family based sequencing data analysis
BACKGROUND Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional a...
متن کاملThe Use of Non-Variant Sites to Improve the Clinical Assessment of Whole-Genome Sequence Data
Genetic testing, which is now a routine part of clinical practice and disease management protocols, is often based on the assessment of small panels of variants or genes. On the other hand, continuous improvements in the speed and per-base costs of sequencing have now made whole exome sequencing (WES) and whole genome sequencing (WGS) viable strategies for targeted or complete genetic analysis,...
متن کاملAllele-specific copy-number discovery from whole-genome and whole-exome sequencing
Copy-number variants (CNVs) are a major form of genetic variation and a risk factor for various human diseases, so it is crucial to accurately detect and characterize them. It is conceivable that allele-specific reads from high-throughput sequencing data could be leveraged to both enhance CNV detection and produce allele-specific copy number (ASCN) calls. Although statistical methods have been ...
متن کاملNeed for speed in accurate whole-genome data analysis: GENALICE MAP challenges BWA/GATK more than PEMapper/PECaller and Isaac.
In the current high-throughput genomics era, efficient and accurate analysis of large-scale whole-genome sequencing (WGS) data constitutes a computational bottleneck. Johnston et al. (1) introduce the PEMapper/ PECaller software package for short-read WGS alignment and variant calling, promising faster analyses with reduced output file sizes and “nearly identical (or better)” variant calling ac...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 33 15 شماره
صفحات -
تاریخ انتشار 2017